The Geography of Scientific Productivity: Scaling in U.S. Computer Science 
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Here we extract the geographical addresses of authors in the Citeseer database of computer science 
papers. We show that the productivity of research centres in the United States follows a power-law 
regime, apart from the most productive centres for which we do not have enough data to reach 
definite conclusions. To investigate the spatial distribution of computer science research centres in 
the United States, we compute the two-point correlation function of the spatial point process and 
show that the observed power-laws do not disappear even when we change the physical representation 
from geographical space to cartogram space. Our work suggests that the effect of physical location 
poses a challenge to ongoing efforts to develop realistic models of scientific productivity. We propose 
that the introduction of a fine scale geography may lead to more sophisticated indicators of scientific 
output. 
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I. INTRODUCTION 

In the last decade, the analysis of mankind's scien- 
tific endeavour has become a rapidly expanding interdis- 
ciplinary field. This has been mainly due to the advent of 
comprehensive online preprint servers and paper repos- 
itories, from which patterns of productivity and collab- 
oration networks of individual scientists can be readily 
ascertained [| . The vast amount of available data raises 
the hope that scientists and policy makers will soon be 
able to gain unprecedented insights into the location of 
research centres and their productivity. Indeed, little is 
known today about the influence that geographical loca- 
tion may have on " invisible colleges" [34| (but see 0] ) • 
Conversely, we are only just beginning to uncover how 
the historical growth of these "invisible colleges" gener- 
ates heterogeneities in the physical location of research 
centres and, therefore, of the scientists themselves. 

Previous investigations of bibliometric data 0] by 
physicists have followed two main directions. On one 
hand, efforts have focused on characterizing the topologi- 
cal structure of collaboration networks 0,0 11)0- On the 
other, researchers have used tools of statistical physics to 
gain insi ght into the growth dynamics of scientific out- 
puts [Hj, HH U2 • Despite this considerable progress, the 
relation of collaboration networks to the productivity of 
scientists depends on the still poorly understood fine ge- 
ographical location of research centres. 

Matia et at approached the challenge of characteriz- 
ing institutional productivity by analyzing 408 U.S. in- 
stitutes for the 11 year period 1991 - 2001 0- They 
observe a bimodal distribution and conjecture that this 
is indicative of a clustering effect of institutes of two dif- 
ferent size classes |T^ . 

The characterization of spatial structures at large geo- 
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graphical scales has a long tradition. In 1971, Glass and 
Tobler were the first to apply the radial distribution func- 
tion (or two- poin t correlation function, as it is known in 
astrophysics [13|]) to the study of cities on a part of the 
Spanish plateau 0,0]. They choose a 40 mile square, 
homogeneous in town size and density, and apply con- 
cepts developed in the study of the statistical mechanics 
of equilibrium liquids. Although their analysis does not 
detect clustering, we would expect the two-point corre- 
lation function to reveal patterns of concentration and 
clustering in data whose population sizes vary over many 
orders of magnitude. 

Recently, Yook et al. showed that the nodes of the in- 
ternet are embedded on a fractal support driven by the 
fractal structure of the population worldwide [17]. This 
suggests that, in spatial networks with strong geograph- 
ical constraints, the nodes may not be distributed ran- 
domly in space (l8|. but may be clustered as a function 
of population density. Further, Gastner and Newman 
presented an algorithm based on physical diffusion to 
draw density equalizing maps, or cartograms, in which 
the sizes of geographic regions appear in proportion to 
their population or some other property [l9l ] . Cartograms 
give us a tool to probe into the dependence of one spa- 
tial variable (e.g. cancer occurrences) upon another (e.g. 
population). In particular, processes which are spatially 
clustered, but dependent on population densities, are ex- 
pected to display random spatial distributions once the 
data are transformed by the cartogram 0, [2(j . 

In order to bring the productivity of research centres 
and their spatial interaction patterns under a single roof, 
we follow a different, but complementary approach to the 
ones presented above. Indeed, research centres are not 
homogeneously distributed in geographical space and it 
is likely that location will impact on their productivity 
and the structure of collaboration networks. However, to 
fully understand the role of location on the production 
of science and its networks, one must first characterize 
the underlying spatial processes, and this is the road we 
take here. We therefore investigate scientific productiv- 
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ity as a function of fine scale geographical location. Fur- 
thermore, to underpin these results, we characterize the 
spatial point process generated by the physical location 
of research centres. 

To investigate the role of fine scale geography in 
the production of science, one needs to analyze a 
large dataset. Traditional investigations of bibliomet- 
ric data have been carried out by analyzing databases 
like PubMed, arXiv.org or Thomson 1ST However, these 
databases suffer from drawbacks. Either the data con- 
tains only the address of the first (PubMed) or cor- 
responding author (arXiv.org), or researchers are not 
uniquely associated with their addresses (Thomson ISI) . 

A more promising source of data is the Citeseer digi- 
tal library, created in 1998 as a prototype of Autonomous 
Citation Indexing [21j . Citeseer locates computer science 
articles on the web in Postscript or PDF format and ex- 
tracts citations from and to documents [2^| . Citeseer has 
made its metadata available online and the inclusion 
of an address and affiliation fields for each author allows 
a first rigorous analysis into the geography of a very large 
bibliometric database. 



II. SPATIAL STRUCTURE 

We studied the Citeseer metadata, which contains 
716, 772 records, some of which are repeated and some 
of which have authors with empty address fields. We 
considered the N = 379, 111 (52.9%) unique papers for 
which citeseer identifies all authors and their respective 
addresses. Out of these N unique papers, we analyzed 
the M = 128, 348 (pus = 33.9%) papers which have one 
or more U.S. authors. Interestingly, pus, is m reasonable 
accordance with Thomson ISI global indicators, which 
state that between 1997 and 2001, the United States out- 
put 34.86 % of the world's highly cited publications [23| . 

For each paper, we extracted the 5— digit ZIP code from 
each author's address field and geocoded this ZIP into a 
{latitude, longitude) pair of coordinates 36]. We iden- 
tified ZIP codes from the address field, by using regular 
expressions to match a five-digit code (plus the optional 
four digit code, which we ignored) preceded or followed 
by a U.S. state (or its abbreviation) or the acronym USA. 
This will leave out addresses like Roma 00185, Italy or Is- 
rael 84105, but will also fail to locate the address Physics 
Department, Northeastern University, Boston MA USA 
as it lacks a ZIP code. We restricted the analysis to the 48 
conterminous U.S. states plus the District of Columbia. 

We identified a total of 116,771 distinct authors with 
a U.S. address. Out of these, 103,928 (89%) list a single 
ZIP code in their address, 10, 579 (9.1%) belong to insti- 
tutions located in two ZIP codes and 2,264 (1.9%) are 
located in three or more institutions. 



Rank 


Zip 


Fractional Count 


Institution 


1 


15213 


2343.36 


Carnegie Mellon University 


2 


02139 


1891.18 


MIT 


3 


94305 


1512.12 


Stanford University 


4 


94720 


1496.76 


University of California, 








Berkeley 


5 


20742 


1144.70 


University of Maryland, 








College Park 



TABLE I: Most productive ZIP codes and respective Univer- 
sities. 



A. Productivity of Research Centres 

To investigate the concept of scaling in publication out- 
put of academic research centres, we computed the prob- 
ability distribution of total paper output per ZIP code. 
We note that ZIP codes were not aggregated. If two 
research centres belonging to the same institution have 
addresses with distinct ZIP codes, we considered them 
as distinct centres. This has the disadvantage of possibly 
counting more than one research centre per institution 
(instead of aggregating both to the same institution). 
However, Citeseer covers scientific articles in the field of 
computer science and it would be the exception that one 
institution would have several geographically separated 
computer science centres. 

Our analysis identified 3, 393 different ZIP codes that 
matched the U.S. census bureau tables. We implemented 
a version of fractional counting 0, 0] to compute the 
productivity of U.S. research centres. For every paper, 
we parsed each author's address field and extracted the 
ZIP codes therein (there may be more than one ZIP, if 
the author belongs to more than one U.S. institution). 
Each occurrence of a ZIP code in an address field of a 
paper increments the productivity of the research centre 
physically located at that ZIP code by l/(f>, where the 
normalization factor <f> is computed as follows. For every 
address field in the paper being analyzed, we made <fi := 
4> + 1 if the address contains no ZIP codes (i.e. it is a 
non-U. S. address), or cp := <f> + m if the address contains 
m > 1 ZIP codes (in which case that specific author will 
belong to m distinct U.S. institutions). 

Identifying research centres by ZIP code has the ad- 
vantage of simplifying the data parsing algorithm, which 
is why we preferred this method to others based on ag- 
gregation by host institution. However, the method is an 
approximation, as it cannot distinguish between non-U. S. 
addresses. 

Table H] displays the five most productive ZIP codes 
and their host institutions. Interestingly, the two most 
productive institutions, Carnegie Mellon University and 
MIT are also the two most acknowledged entities as 
shown by Giles and Councill in a previous study |21| . 

We then asked the question: what is the probability 
distribution of the research output of each research cen- 
tre? To investigate this, we plot the probability density 
and cumulative distribution (P [X > x] — p (y) dy) 
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in Figure H We found a bimodal probability distribution 
of research output by ZIP code (see Figure^,), in agree- 
ment with a previous study of the Thomson ISI database 
by Matia et. al 

Our results suggest that this probability distribution 
displays power-law decay up to the "knee" where the 
regime changes. Data was insufficient to determine 
whether the upper tail of the distribution also decays as 
a power-law, albeit with a different exponent. This ob- 
servation is in apparent contradiction with the findings 
of Matia et al. who do not find a power-law regime. The 
authors examine the productivity of 408 U.S. institutes, 
whereas our method revealed that papers had been out- 
put at 3,393 U.S. institutes. Therefore, the power-law 
decay which we observed may be due to our methodol- 
ogy which included all research institutes in the meta- 
data. On the other hand, our analysis was limited in 
scope to the Citeseer database, whereas Matia et al. an- 
alyze the Thomson ISI dataset, hence comparisons with 
their wider study are necessarily inconclusive. Neverthe- 
less, our results raise the question of whether power-law 
decay only appears once one is able to identify a large 
percentage of all research institutes. 



B. The Pulling Power of Research Clusters 

A simple point process in R 2 may be considered as a 
random countable set X C M. 2 . The first moment of a 
point process can be specified by a single number, the in- 
tensity, p, giving the expected number of points per unit 
area. The second moment can be specified by Ripley's 
K function |16| . where pK(r) is the expected number 
of points within distance r of an arbitrary point of the 
pattern. 

The product density 

p 2 (xi,x 2 ) dA (xi) dA (x 2 ) = p 2 g (r) dA (x x ) dA (x 2 ) 

describes the probability to find a point in the area el- 
ement cL4(xi) and another point in gL4(x2), at the dis- 
tance r = |xi — Xa|, and g (r) is the two-point correlation 
function. Ripley's K function is related to g (r) by j2(J 



K (r) = 2-7T J g (r) rdr 



(2) 



In other words, g (r) is the density of K (r) with re- 
spect to the radial measure rdr. The benchmark of com- 
plete randomness is the spatial Poisson process, for which 
g (r) = 1 and K(r) = irr 2 , the area of the search region 
for the points. Values larger than this indicate cluster- 
ing on that distance scale, and smaller values indicate 
regularity. 

The two-point correlation function can be estimated 
from N data points x G D inside a sample window W by 
M: 



9(f) = 
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FIG. 1: Probability distribution of paper output (fractional 
counts) per ZIP code, (a) The probability density is bi-modal 
and can be approximated by a power-law regime between the 
two local maxima, (b) A least squares fit to the linear region 
of the cumulative distribution yields a = 1.55. The inset 
shows the Hill plot [25J as the number of upper order statistics, 
k, is varied. The match between the plateau on the Hill plot 
and the least squares fit (dashed horizontal line), shows that 
our estimate of a is appropriate. 



where 2irr A is the area of the annulus centred at x with 
radius r and thickness A. Here |W| is the area of the 
sample window, and the sum is restricted to pairs of dif- 
ferent points x ^ y. The function $ r is symmetric in its 
argument and $ r (x, y) = [r < d (x, y) < r + A], where 
d(x,y) is the Euclidean distance between the two points 
and the condition in brackets equals 1 when true and 
otherwise. 

The function u> (x, y) accounts for a bounded W by 
weighting points where the annulus intersects the edges 
of W. There are a number of edge-corrections available, 
but that of Ripley |15| has a long tradition both in human 
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FIG. 2: a) Albers' equal-area projection of the 48 conterminous states of the US plus the district of Columbia. Research centres 
are identified by circles with area proportional to their productivity on a logarithmic scale. b)-d) Data in a) after a cartogram 
transformation with R&D expenditure by state (b), and population by state (c) and county (d), respectively. For each panel, 
we trace the 14, 605 point border polygon used in the computation of the two-point correlation function. 



geography and physics [2] 



2nr 



F {dB r (x)C\W) 



(4) 



where F (dB r (x) n W) is the fraction of the perimeter of 
the circle B r (x) with radius r = \x — y\ around x inside 
W -e.g. F (dB r (x) (~1 W) = irr if only half of the annulus 
falls inside W. Note that u (x, y) = 1 iff 8B r (x) C W, 
in which case the summand in J2J is simply the sum of 
<& r (x, y) weighted by the area of the annulus centred at 
x with radius r and thickness A. If dB r (x) n W ^ 0, 
that is the circle B r (x) is only partially in the sample 
window W, then $ r (x, y) is weighted by the area of the 
fraction of the annulus which is inside W. 

Of special physical interest is whether the two-point 
correlation is scale-invariant. A scale-invariant g (r) is 
an indicator of a fractal distribution of research centres, 
and is expected in critical phenomena p8[ . 

To investigate the presence of power-law decay in the 
two-point correlation function we selected the 1,046 re- 
search centres (ZIP codes) which had a total fractional 



count of two papers or more. We chose this productivity 
threshold for two main reasons. A first factor was to con- 
sider only research centres which can be clearly identified 
as active. Second, the computation of the two-point cor- 
relation function requires reasonable computer resources 
as VV is a fine boundary of the United States -in our case, 
a polygon with 14, 605 points. 

Next, we projected the U.S. map and the 
(latitude, longitude) pairs of the research centres 
with the Albers' equal area projection [29([37( and 
computed the two-point correlation function, g(r), of 
the resulting point process. 

To investigate whether the decay of g (r) is a function 
of the distribution of R&D funding or population, we ap- 
plied several cartogram transformations to the base map 
and the points: first, we computed the cartogram pro- 
jection using U.S. R&D funding expenditure, by state, 
for the year 2001 |30l table B-17]; second we computed 
the cartogram with U.S. population, by state and county, 
from the 2000 census j^l • The points representing the re- 
search centres were transformed accordingly to each car- 



5 



togram. Figure^) shows the Albers' equal area projec- 
tion and each centre is represented by a circle with area 
proportional to the number of papers output on a loga- 
rithmic scale. Figures |2b)-d) show the cartograms with 
R&D expenditure by state, and population by state and 
county, respectively. It is obvious from these maps that 
as the cartogram transformation uses finer spatial scales 
(e.g. from U.S. states to counties), the points become 
more homogeneously distributed spatially. 

The two-point correlation function computed for the 
projected data (see Figure^)) is plot in Figure^ where 
we observe a power-law decay g (r) ~ r~ 7 with 7 ~ 1.16. 
Next we asked the following question: can the power- 
law decay of g (r) be explained by a clustering of re- 
search centres in areas where research funding or pop- 
ulation is higher? To answer this question, we computed 
g (r) for the same point process, but now using the data 
transformed by the cartograms with R&D expenditure by 
state (Figure Eb)) > population by state (Figure^)), and 
population by county (Figure Our results showed 

that the power-law decay was still present after the car- 
togram projections, although as the transformation was 
performed at finer spatial scales, g (r) approached the ex- 
pected value for a Poisson process, g (r) — 1, at shorter 
distances. 




10° I0 1 10 ! 10 s 



r in Km 

FIG. 3: Variation of the two-point correlation function with 
distance (Km). In blue, g (r) computed from projections of 
the border and the points with the Albers' equal area pro- 
jection. In red, green and black, g (r) computed from further 
transforming the data by the cartogram projection with R&D 
per state, population per state and county, respectively. The 
horizontal line at g = 1 is the expected value of g (r) for a 
Poisson process. 



III. DISCUSSION 

Considerable advances have been made over the past 
few years in understanding the structure of scientific pro- 
duction and its networks. Along this road, physicists 
have computed a number of quantities to characterize 
networks of scientific collaborations, mainly by analyz- 
ing data from online preprint servers and repositories. 
However, these studies have not addressed the impact of 
fine scale physical location on the statistical characteri- 
zation of the scientific enterprise and it networks. Here 
we have presented a detailed study of the productivity 
of research centres in U.S. computer science (identified 
by ZIP codes) and characterized the pattern of spatial 
concentration which these centres display. 

A first important conclusion of our study is that the 
productivity of U.S. research centres in computer science 
was highly skewed. A surprising result of our study was 
the power-law decay of the probability distribution of re- 
search output for some orders of magnitude. A second 
important conclusion is that the physical location of re- 
search centres in the U.S. formed a fractal set, which 
was not completely destroyed by population or research 
funding patterns. 

Although we consider our results to be promising, there 
are still several caveats. First our conclusions are clearly 
only valid for the U.S. HHHl and even from the Citesecr 
database, which we consider is the best currently avail- 
able for such analysis, there are problems of missing and 
inaccurate data which we are not able to quantify. Nev- 
ertheless, our results are consistent with those from the 
burgeoning geography of information technology which 
suggests in qualitative fashion, that such technologies are 
correlated with population but also have their own dy- 
namic |32l l33|. In this sense, our result that the scaling 
inherent in the geographical distribution of paper pro- 
duction in U.S. computer science is still present once the 
geography has been normalized with respect to the distri- 
bution of population and R&D expenditures, implies pro- 
cesses that are endogenous to the dynamics of research 

El- 

In summary, the method introduced in this paper could 
serve as a starting point for an investigation of the role 
of the fine scale physical location of research centres in 
the production of science. Our study focused on U.S. 
computer science but further analyses should be possi- 
ble as preprint server repositories make more elaborate 
metadata available. And such developments may lead to 
a better understanding of the role of physical location 
not just in science, but for a much wider class of complex 
spatial systems. 
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